Deletion variants calling in third-generation sequencing data based on a dual-attention mechanism

Abstract Deletion is a crucial type of genomic structural variation and is associated with numerous genetic diseases. The advent of third-generation sequencing technology has facilitated the analysis of complex genomic structures and the elucidation of the mechanisms underlying phenotypic changes and disease onset due to genomic variants. Importantly, it has introduced innovative perspectives for deletion variants calling. Here we propose a method named Dual Attention Structural Variation (DASV) to analyze deletion structural variations in sequencing data. DASV converts gene alignment information into images and integrates them with genomic sequencing data through a dual attention mechanism. Subsequently, it employs a multi-scale network to precisely identify deletion regions. Compared with four widely used genome structural variation calling tools: cuteSV, SVIM, Sniffles and PBSV, the results demonstrate that DASV consistently achieves a balance between precision and recall, enhancing the F1 score across various datasets. The source code is available at https://github.com/deconvolution-w/DASV.


Index.
s a m t o o l s i n d e x ./ s o r t −d a t a .bam

Depth.
s a m t o o l s depth ./ s o r t −d a t a .bam > ./ s o r t −d a t a .bam .t x t callers cuteSV cuteSV is a sensitive, fast, and scalable long-read-based structural variation (SV) detection approach.It uses tailored methods to comprehensively collect the signatures of various types of SVs, and a clustering-and-refinement method to analyze the signatures to implement a stepwise, highly sensitive SV detection.Benchmarks on simulated and real long-read sequencing datasets demonstrate that cuteSV has higher yields and scaling performance than state-of-the-art tools.

Evaluation metrics
These metrics(Precision, Recall, F1-score) are usually used in binary classification problems, where the positive examples are the target categories we are interested in, and the negative examples are the categories other than the target categories.Firstly, TP, FP, TN and FN are four important metrics used to evaluate the performance of classification models: • TP (True Positive) is the number of positive samples that the model correctly predicts as positive.
• FP (False Positive) is the number of negative samples that the model incorrectly predicts as positive.
• TN (True Negative) is the number of negative samples that the model correctly predicts as negative.
• FN (False Negative) is the number of positive samples that the model incorrectly predicts as negative.

Precision
Precision is a measure of how many of the positive predictions made are correct (true positives).Formula: P recision = T P T P + F P

Recall
Recall is a measure of how many of the positive cases the classifier correctly predicted, over all the positive cases in the data.It is sometimes also referred to as Sensitivity.Formula: Recall = T P T P + F N F1-Score F1-Score is a measure combining both precision and recall.The F1 score is the reconciled average of Precision and Recall, which combines Precision and Recall, taking into account both the accuracy and the recall ability of the model.These metrics are commonly used to evaluate the performance of classification models in the presence of positive and negative sample imbalances, where Precision is concerned with the accuracy of the model and Recall is concerned with the comprehensiveness of the model, while F1-score combines both.Formula: F 1Score = 2 * P recision * Recall P recision + Recall

One hot coding
If there are a total of N types of labels to construct one hot coding, a dictionary is constructed by using 0 N − 1 integers corresponding to the elements one-to-one, and the i corresponding to each element is its index.To represent an element, a vector of length N is used, and the ith position of the vector is assigned as 1 and the rest of the positions are assigned as 0. The processed vector can then be input into the neural network for training.

Epoch and Batch
Epoch means that all the data is fed into the network, completing a forward calculation and a back propagation process.Since an epoch is often too large, it is divided into several small baches.It is not enough to iteratively train all the data once, it has to be repeated several times to fit and converge.In practice, all of the data is divided into several baches and fed a portion of the data at a time.As the number of epochs increases, the number of weight update iterations increases and the curve moves from the unfit state to the optimised fit state.

CNN: convolutional neural networks
CNNs are a type of feed-forward neural network that learns feature engineering by itself via filters (or kernel) optimization.They are designed to automatically and adaptively learn spatial hierarchies of features from tasks with input data.Convolutional Layer is the core building block of a CNN.The layer's parameters consist of a set of learnable filters (or kernels).These filters are small spatially (along width and height), but extend through the full depth of the input volume.During the forward pass, each filter is convolved across the width and height of the input volume, computing the dot product between the entries of the filter and the input and producing a 2-dimensional activation map of that filter.As a result, the network learns filters that activate when it detects some specific type of feature at some spatial position in the input.

Global Average Pooling and Global max Pooling
Global Average Pooling (GAP): This operation calculates the average value of each feature map separately.It is designed to replace fully connected layers in classical CNNs.The idea is to generate a feature map for each corresponding category of the classification task in the last layer.Instead of adding fully connected layers on top of the feature maps, the average of each feature map is taken and the resulting vector is fed directly into the softmax layer.An advantage of GAP over the fully connected layers is that it is more native to the convolutional structure by enforcing correspondences between feature maps and categories.Another advantage is that there are no parameters to optimise in the GAP, thus avoiding overfitting in this layer. .Furthermore, GAP sums the spatial information and is therefore more robust to spatial translations of the input.Global Max Pooling (GMP): This operation calculates the maximum value of each feature map over its entire spatial extent.It is similar to GAP, but instead of taking the average, it takes the maximum value of each feature map.GMP is commonly used to convert convolutional features of variable size images into a fixed size embedding.However, both GAP and GMP are computed spatially independent: each individual activation map is pooled, so activations from different locations are pooled together.

ResNet
The key feature of ResNet is the use of Residual Modules and Residual Connections to build the network, allowing deeper networks to be trained without the problem of gradient loss.Specifically, ResNet introduces the Shortcut Connection, which adds a cross-layer connection to each Residual Module, allowing information to be passed directly to later layers, preserving the original features and preventing them from disappearing layer by layer.

LeakyReLU
LeakyReLU (Leakage Linear Rectifier Unit) is an activation function based on ReLU, but with a small slope in the negative part rather than a flat slope.This slope coefficient is determined before training, i.e. it is not learned during the training process.

Sigmoid
The sigmoid function is a type of mathematical function that has a characteristic S-shaped curve or sigmoid curve1.It is defined for all real input values and has a non-negative derivative at each point.It also has exactly one inflection point.

Batch Normalization
Batch normalisation, also known as batch norm, is a technique used to improve the training of deep neural networks.It was proposed in 2015 by Sergey Ioffe and Christian Szegedy.The main idea behind batch normalisation is to normalise the inputs of each layer so that they have a mean output activation of zero and a standard deviation of one.This is done for each mini-batch, hence the name "batch normalisation".

Attention mechanism
The attention mechanism is a pivotal concept in the field of machine learning, particularly in the realm of natural language processing and neural networks.The Attention mechanism, if understood superficially, matches his name very well.His core logic is "From Attention to All to Attention to Focus".As a simple example, the following image has a dog in it, and we want to use a neural network to determine whether it is a dog or a cat.Normally, the background information of the image would affect the network's judgment, but the attention mechanism allows the algorithm to focus more on the area where the dog is located.
cuteSV −−m i n r e a d l e n 150 −−m i n s u p p o r t 4 −−m i n s i z e 25 ./ s o r t −d a t a .bam ./ d a t a .f a ./ benchmark / cuteSV1 .v c f ./ benchmark / SNIFFLES SNIFFLES is a fast structural variant caller for long-read sequencing.It accurately detects structural variants (SVs) on germline, somatic, and population-level for PacBio and Oxford Nanopore read data.To call SVs from long read alignments (PacBio / ONT), Sniffles is used.s n i f f l e s −s 3 −d 200 −t 4 − l 20 −r 100 −m ./ s o r t −d a t a .bam −v ./ s n i f f l e s .v c f SVIM SVIM, which stands for Structural Variant Identification Method, is a structural variant caller for third-generation sequencing reads.It is capable of detecting and classifying six classes of structural variation: deletions, insertions, inversions, tandem duplications, interspersed duplications, and translocations.Unlike other methods, SVIM integrates information from across the genome to precisely distinguish similar events, such as tandem and interspersed duplications and simple insertions.svim a l i g n m e n t −−m i n s v s i z e 30 −−minimun depth 3 ./ benchmark / ./ s o r t −d a t a .bam ./ d a t a .f a PBSV PBSV is a suite of tools to call and analyze structural variants in diploid genomes from PacBio single molecule real-time sequencing (SMRT) reads.It calls insertions, deletions, inversions, duplications, and translocations.Both single-sample calling and joint (multisample) calling are provided.PBSV is most effective for insertions 20 bp to 10 kb, deletions 20 bp to 100 kb, inversions 200 bp to 10 kb, duplications 20 bp to 10 kb, and translocations between different chromosomes or further than 100kb apart on a single chromosome.pbsv d i s c o v e r −m 40 −y 10 −w 260 −a 25 −k 80 ./ d a t a .bam ./ d a t a .bam .s v s i g .gz pbsv c a l l ./ d a t a .f a ./ d a t a .bam .s v s i g .gz ./ pbsv .v c f Combisv p e r l combiSV2 .0 .p l −pbsv ./ pbsv .v c f − s n i f f l e s ./ s n i f f l e s .v c f −c u t e s v .v/ cuteSV .v c f −svim ./ svim .v c f −c 3 −o 1 simulate Generate .fasta ./SURVIVOR simSV p a r a m e t e r f i l e ./SURVIVOR simSV ./ c h r 2 .f a p a r a m e t e r f i l e 0 . 1 0 ./ s i m u d a t a Install PaSS and run PaSS PaSS .c −o PaSS −lm −l p t h r e a d p e r l ./ PaSS/ p a c b i o m k i n d e x .p l ./ s i m u d a t a .f a s t a ./ s i m u l a t i o n / ./ PaSS − l i s t p e r c e n t a g e .t x t −i n d e x i n d e x −m p a c b i o R S −c ./ sim .c o n f i g

Fig. 2 .
Fig. 2. One epoch.We will loop this process N times until the network converges.

Fig. 4 .
Fig. 4. Example of Global Average Pooling and Global max Pooling.